Babbel Challenge

Simulation

Here I use the ThompsonSampling() and SlotMachine() classes for simulating a Thompson Sampling run for 10k steps.

Evolution of posterior distributions

Here I plot the posterior distributions for the 5 slot machines, the time axis is literally the time, being the plot an interactive animation.

(Probably best viewed as html file in output/distributions_evolution.html)

Evolution of best point estimate

Here I plot the means and stds of the beta distributions over time.

The best point estimate at each point in time is the mean with highest value. The stds give a measure of the confidence of the agent.

(p on the y axis, time on the x axis)

(Probably best viewed as html file in output/distributions_timeline.html)

Regret

Here I plot the regret over all the steps

(regret on the y axis, time on the x axis)

Questions

How well does the algorithm discover the true individual success probabilities?

Not very well.

The algorithm focus most of the time on the best(s) slot(s) and converge to a good estimate for these, but it neglects the others slots and it doesn't arrive to a good estimate for those.

How does this relate to the probabilities themselves, and why?

The slots with lower probabilities are neglected because of the exploration-exploitation tradeoff. When the agent is confident enough that the best slot is not between one of these, it focuses almost only on the best ones, exploiting them.

Can you think of a simple modification of the algorithm to improve the accuracy of the estimation of the success probabilities? (Consider this a bonus question.)

In order to explore more we could just use a uniform distribution instead of a beta distribution (or just cycle through the slots). Doing so we would explore much more and we would converge to better estimates (but we would exploit less).